Skip to content

Conversation

@sophiatev
Copy link
Contributor

@sophiatev sophiatev commented Aug 20, 2025

AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync is not correctly catching the exception thrown in the case of a conflict when calling the ITrackingStore.UpdateStateAsync method. The InstanceStoreBackedTrackingStore implementation actually does not seem to throw any exceptions (it does an "unsafe" upsert, not using the passed in etag at all), while the AzureTableTrackingStore implementation wraps the currently-caught RequestFailedException exception in a DurableTaskStorageException, which is not being correctly caught. This PR adds this exception check to the method, and removes the no long relevant check for a RequestFailedException exception.

It also adds an additional check for the case of a HttpStatusCode.Conflict being returned - if this is the first work item for the orchestration, then there is no history and etag stored, and we follow this logic with action type TableTransactionActionType.Add. If the work item has already been completed, then the history already exists, and so the storage update fails with a HttpStatusCode.Conflict indicating that the table entity already exists. If this is not the first work item, then there is an etag stored, and the action type is instead TableTransactionActionType.UpdateMerge. In this case, if the work item has already been completed and a new etag stored, then the update fails with a HttpStatusCode.PreconditionFailed.

An example scenario that was previously not being caught, but is now correctly caught:

  1. Worker A obtains a lease on partition 1 and starts working on a work item. Say for whatever reason, worker A does not renew the work item in a timely fashion (which in practice means that the control queue messages for the item become visible again).
  2. Worker B steals the lease on partition 1 from worker A, dequeues the now-visible control queue messages from the partition, and starts working on the same work item.
  3. Worker B completes the work item and calls AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync.
  4. Worker A eventually finishes and also attempts to call AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync. If this is the first work item for the orchestration, then a DurableTaskStorageException with status code Conflict is thrown. If it is not the first work item, then the status code is PreconditionFailed. In either case, both of these exceptions are now caught and a SessionAbortedException is thrown, which is the intended behavior.

@sophiatev sophiatev changed the title Adding an Exception Case to CompleteTaskOrchestrationWorkItemAsync Adding an Exception Case to AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync Aug 20, 2025
@sophiatev sophiatev changed the title Adding an Exception Case to AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync Fixing the Exception Case for AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync Aug 26, 2025
@davidmrdavid
Copy link
Collaborator

davidmrdavid commented Aug 26, 2025

Update: seems this feedback has been incorporated, ty.

non-blocking tip (I'm just an OSS contributor now) - this seems like an involved failure case. I recommend adding in the PR description a short example of how this error case is reached, in the form of a timeline.

For example:
At

  • t=1, such and such happens.
  • t=2, then this happens
  • t=3, this happens, and now we're in an inconsistent state because Y
  • t=4, the inconsistency is observed, and the exception is thrown.

It might make it easier to determine that the PR is doing what's expected. Though others caught up with the context might not need this level of detail.

// Precondition failure is expected to be handled internally and logged as a warning.
// The orchestration dispatcher will handle this exception by abandoning the work item
throw new SessionAbortedException("Aborting execution due to failed precondition.", rfe);
throw new SessionAbortedException("Aborting execution due to conflicting completion of the work item.", dtse);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - if this is possibly customer facing, it's useful to add a little hint here that the error is benign. Consider:

Suggested change
throw new SessionAbortedException("Aborting execution due to conflicting completion of the work item.", dtse);
throw new SessionAbortedException("Aborting execution because the work item was already completed. This can occur when another worker took over processing, and is benign when it occurs rarely", dtse);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SessionAbortedExceptions do not surface, though someone else from the team correct me if I'm wrong, please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't insist on this one, it's my personal style to augment these nonetheless but it really doesn't matter. Feel free to dismiss.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my comment changes, and the "partitionCount" log that should be generalized, the behavior here looks right to me.

Great job with the test, I personally don't think I ever had to write a test with such low level control over the orchestration processing state machine, so it's cool to see that in action.

As always - I'm just an OSS contributor, so please wait for someone actively in the team to approve before moving forward. Just wanted to signal that behavior-wise this seems correct, and that test seems complete.

Update: I cleared my approval just to avoid accidentally confusing the team into thinking the PR is already approved by a team member.

@davidmrdavid davidmrdavid self-requested a review August 27, 2025 15:59
@sophiatev sophiatev merged commit 3cb8d43 into main Sep 12, 2025
44 checks passed
@sophiatev sophiatev deleted the stevosyan/fix-wrong-exception-catching-for-etag-mismatch branch September 12, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants